Skip to content

perf: cap planner budget when model dwarfs the streaming budget#1612

Merged
leejet merged 5 commits into
leejet:masterfrom
fszontagh:perf/smaller-merged-segments
Jun 8, 2026
Merged

perf: cap planner budget when model dwarfs the streaming budget#1612
leejet merged 5 commits into
leejet:masterfrom
fszontagh:perf/smaller-merged-segments

Conversation

@fszontagh
Copy link
Copy Markdown
Contributor

Summary

When the model is much bigger than --max-vram, the planner currently merges the base segments into 1-2 huge merged segments. The follow-up worst_merged_segment_footprint reservation in compute_streaming_segments then leaves almost nothing for chunk-K residency, so on a model like Z-Image bf16 (11.7 GB on a 12 GB GPU) chunk-K stays near 0 and every sampling step re-uploads the full model.

When the model fits comfortably in the budget the current behaviour is already optimal: one big merged segment, chunk-K covers the whole model, no per-step H2D.

This PR caps the budget passed to resolve_plan at a quarter of the streaming budget when total_params_bytes > 0.75 * effective_budget, otherwise it passes the full budget. Smaller merged segments shrink the worst_merged_segment_footprint reservation downstream, which frees enough residency budget for a meaningful chunk-K. Small/quantized models are unaffected.

Quantization-aware: total_params_bytes is summed via ggml_nbytes, so a Q8/Q4 model is correctly identified as small relative to the budget.

Related Issue / Discussion

Continuation of the streaming-budget series #1576, #1598, #1601, #1611.

Additional Information

RTX 3060 12 GB, --offload-to-cpu --stream-layers --max-vram -1:

Workload Before After
SDXL bf16 1152x896 batch=2 8 steps 21.6 s 20.9 s
Z-Image bf16 1024x688 batch=2 9 steps 138 s 98 s

SDXL stays in the full-budget path so the plan is unchanged (1 merged segment, whole UNet resident). Z-Image takes the T/4 path, the planner produces 9-11 merged segments instead of 2, and chunk-K grows from ~0 to ~3.5 GB, dropping per-step H2D enough for a ~30% wallclock win.

We tested T/2, T/3, T/5, T/6, T/8 on the Z-Image workload; T/4 is the empirical optimum on the test hardware (smaller fractions trade chunk-K headroom for per-dispatch overhead and regress).

Checklist

@leejet
Copy link
Copy Markdown
Owner

leejet commented Jun 7, 2026

The heuristic makes sense for --stream-layers, but this currently applies before the stream_layers_enabled branch, so large models can get effective_budget / 4 planner merges even when streaming is disabled. That changes non-streaming behavior and may increase dispatch/segment overhead without providing chunk-K residency benefits. Can we gate the cap on stream_layers_enabled?

@fszontagh
Copy link
Copy Markdown
Contributor Author

Good catch - gated on stream_layers_enabled in 88a5ee4. Non-streaming path now keeps the full effective_budget.

@leejet leejet merged commit 17a2b4a into leejet:master Jun 8, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants